168 research outputs found

    Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

    Get PDF
    We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.Comment: 16 pages, 11 figure

    Assessing systematic error in the inference of seed plant phylogeny

    Get PDF
    We used parametric bootstrapping to assess the performance of maximum parsimony and maximum likelihood phylogenetic analyses of a 12-locus seed plant data set. Evidence of biases in maximum parsimony analyses of single-locus data sets may explain some of the locus-specific variation among DNA-based hypotheses of seed plant phylogeny. In particular, there is strong evidence of bias in maximum parsimony analyses, especially of plastid loci, that favors placing Gnetales sister to other seed plants. We concatenated simulated single-locus data sets to examine biases in analyses of a 12-locus data set in which each locus is simulated with different substitution parameters and branch lengths. Maximum parsimony analyses of the simulated 12-locus data set also show evidence of biases in favor of recovering trees with Gnetales sister to other seed plants and against recovering anthophyte, gnepine, and gnetifer trees. These biases are most evident in analyses that include the fastest-evolving characters. In the maximum likelihood analyses of the simulated 12-locus data sets, there is evidence of a bias against recovering the anthophyte hypothesis. Otherwise, there is little evidence that the heterogeneous branch lengths and substitution processes among loci influence the results from maximum likelihood phylogenetic analyses. © 2007 by The University of Chicago. All rights reserved

    Phylogenetic signal in nucleotide data from seed plants: Implications for resolving the seed plant tree of life

    Get PDF
    Effects of taxonomic sampling and conflicting signal on the inference of seed plant trees supported in previous molecular analyses were explored using 13 single-locus data sets. Changing the number of taxa in single-locus analyses had limited effects on log likelihood differences between the gnepine (Gnetales plus Pinaceae) and gnetifer (Gnetales plus conifers) trees. Distinguishing among these trees also was little affected by the use of different substitution parameters. The 13-locus combined data set was partitioned into nine classes based on substitution rates. Sites evolving at intermediate rates had the best likelihood and parsimony scores on gnepine trees, and those evolving at the fastest rates had the best parsimony scores on Gnetales-sister trees (Gnetales plus other seed plants). When the fastest evolving sites were excluded from parsimony analyses, well-supported gnepine trees were inferred from the combined data and from each genomic partition. When all sites were included, Gnetales-sister trees were inferred from the combined data, whereas a different tree was inferred from each genomic partition. Maximum likelihood trees from the combined data and from each genomic partition were well-supported gnepine trees. A preliminary stratigraphic test highlights the poor fit of Gnetales-sister trees to the fossil data

    Assessing among-locus variation in the inference of seed plant phylogeny

    Get PDF
    Large multilocus analyses can greatly reduce sampling error in phylogenetic estimates and help resolve difficult phylogenetic questions. Yet conventional multilocus analyses may be confounded by variation in the phylogenetic signal or processes of evolution among loci. We used nonparametric bootstrapping methods to examine locus-specific variation within a 12-locus seed plant data set and to examine the effects of this variation on estimates of seed plant phylogeny. The observed maximum likelihood and maximum parsimony bootstrap support from phylogenetic analyses of sites within single loci often notably differs from the bootstrap support obtained by sampling an equal number of sites from the concatenated 12-locus data set. This indicates heterogeneity among loci in the phylogenetic inference, and the differences among loci are not explained by the distribution of fast and slowly evolving sites. Bootstrap analyses that resample loci with replacement, rather than sampling individual sites with replacement, reveal extensive sampling variance among loci. The results suggest that seed plant phylogenetic analyses may not be robust to sampling error when only 12 loci are used and indicate a need for further investigation into the causes of the locus-specific variation. © 2007 by The University of Chicago. All rights reserved

    Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene tree - species tree reconciliation problems infer the patterns and processes of gene evolution within a species tree. Gene tree parsimony approaches seek the evolutionary scenario that implies the fewest gene duplications, duplications and losses, or deep coalescence (incomplete lineage sorting) events needed to reconcile a gene tree and a species tree. While a gene tree parsimony approach can be informative about genome evolution and phylogenetics, error in gene trees can profoundly bias the results.</p> <p>Results</p> <p>We introduce efficient algorithms that rapidly search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a given gene tree to identify a topology that implies the fewest duplications, duplication and losses, or deep coalescence events. These algorithms improve on the current solutions by a factor of <it>n </it>for searching SPR neighborhoods and <it>n</it><sup>2 </sup>for searching TBR neighborhoods, where <it>n </it>is the number of taxa in the given gene tree. They provide a fast error correction protocol for ameliorating the effects of gene tree error by allowing small rearrangements in the topology to improve the reconciliation cost. We also demonstrate a simple protocol to use the gene rearrangement algorithm to improve gene tree parsimony phylogenetic analyses.</p> <p>Conclusions</p> <p>The new gene tree rearrangement algorithms provide a fast method to address gene tree error. They do not make assumptions about the underlying processes of genome evolution, and they are amenable to analyses of large-scale genomic data sets. These algorithms are also easily incorporated into gene tree parsimony phylogenetic analyses, potentially producing more credible estimates of reconciliation cost.</p

    The Evolution of Organismal Complexity in Angiosperms as Measured by the Information Content of Taxonomic Descriptions

    Get PDF
    We describe an information theoretic method for measuring relative organismal complexity. The complexity measure is based on the amount of information contained in formal taxonomic descriptions of organisms. We examine the utility of this measure for quantifying the complexity of plant families. The descriptions are subjective by nature, but we find a significant correlation in the complexity values of plant families from two independently authored sets of formal taxonomic descriptions. An analysis of the evolution of complexity across angiosperms found evidence of a pattern of increasing complexity. Our measure of complexity provides an operational definition of complexity that may be applied to any group of organisms and will enable further empirical studies of the evolution of complexity

    Inferring phylogenies with incomplete data sets: a 5-gene, 567-taxon analysis of angiosperms

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Phylogenetic analyses of angiosperm relationships have used only a small percentage of available sequence data, but phylogenetic data matrices often can be augmented with existing data, especially if one allows missing characters. We explore the effects on phylogenetic analyses of adding 378 <it>matK </it>sequences and 240 26S rDNA sequences to the complete 3-gene, 567-taxon angiosperm phylogenetic matrix of Soltis et al.</p> <p>Results</p> <p>We performed maximum likelihood bootstrap analyses of the complete, 3-gene 567-taxon data matrix and the incomplete, 5-gene 567-taxon data matrix. Although the 5-gene matrix has more missing data (27.5%) than the 3-gene data matrix (2.9%), the 5-gene analysis resulted in higher levels of bootstrap support. Within the 567-taxon tree, the increase in support is most evident for relationships among the 170 taxa for which both <it>matK </it>and 26S rDNA sequences were added, and there is little gain in support for relationships among the 119 taxa having neither <it>matK </it>nor 26S rDNA sequences. The 5-gene analysis also places the enigmatic <it>Hydrostachys </it>in Lamiales (BS = 97%) rather than in Cornales (BS = 100% in 3-gene analysis). The placement of <it>Hydrostachys </it>in Lamiales is unprecedented in molecular analyses, but it is consistent with embryological and morphological data.</p> <p>Conclusion</p> <p>Adding available, and often incomplete, sets of sequences to existing data sets can be a fast and inexpensive way to increase support for phylogenetic relationships and produce novel and credible new phylogenetic hypotheses.</p

    Improved Heuristics for Minimum-Flip Supertree Construction

    Get PDF
    The utility of the matrix representation with flipping (MRF) supertree method has been limited by the speed of its heuristic algorithms. We describe a new heuristic algorithm for MRF supertree construction that improves upon the speed of the previous heuristic by a factor of n (the number of taxa in the supertree). This new heuristic makes MRF tractable for large-scale supertree analyses and allows the first comparisons of MRF with other supertree methods using large empirical data sets. Analyses of three published supertree data sets with between 267 to 571 taxa indicate that MRF supertrees are equally or more similar to the input trees on average than matrix representation with parsimony (MRP) and modified min-cut supertrees. The results also show that large differences may exist between MRF and MRP supertrees and demonstrate that the MRF supertree method is a practical and potentially more accurate alternative to the nearly ubiquitous MRP supertree method
    corecore